{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text preprocessing\n",
    "- Basic text preprocessing using Keras API\n",
    "- Doc: https://keras.io/preprocessing/text/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from keras.preprocessing.text import Tokenizer, text_to_word_sequence, one_hot\n",
    "from keras.preprocessing.sequence import pad_sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization of a sentence\n",
    "- Tokenization: the process of converting a sequence of characters into a sequence of tokens (https://en.wikipedia.org/wiki/Lexical_analysis#Token)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sentences = ['Curiosity killed the cat.', 'But satisfaction brought it back']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tk = Tokenizer()    # create Tokenizer instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tk.fit_on_texts(sentences)    # tokenizer should be fit with text data in advance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Converting sentence into (integer) sequence\n",
    "- One of simple ways of modeling text is to create sequence of integers for each sentence\n",
    "- By doing so, information regarding order of words can be preserved"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1, 2, 3, 4], [5, 6, 7, 8, 9]]\n"
     ]
    }
   ],
   "source": [
    "seq = tk.texts_to_sequences(sentences)\n",
    "print(seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### One-hot encoding of sentence\n",
    "- Sometimes, it is preferred to check only whether certain word appeared in sentence or not\n",
    "- This way of characterizing sentence is called \"one-hot encoding\"\n",
    "    - IF word appeared in sentence, it is encoded as **\"one\"**\n",
    "    - IF not, it is encoded as **\"zero\"**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.  1.  1.  1.  1.  0.  0.  0.  0.  0.]\n",
      " [ 0.  0.  0.  0.  0.  1.  1.  1.  1.  1.]]\n"
     ]
    }
   ],
   "source": [
    "mat = tk.sequences_to_matrix(seq)\n",
    "print(mat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Padding sequences\n",
    "- Oftentimes, to preserve the dimensionality of sentences, zero padding is performed\n",
    "- Idea is similar to that of padding exterior of image-format data, but applied to sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 2 3 4]\n",
      " [5 6 7 8 9]]\n"
     ]
    }
   ],
   "source": [
    "# if set padding to 'pre', zeros are appended to start of sentences\n",
    "pad_seq = pad_sequences(seq, padding='pre')     \n",
    "print(pad_seq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 2 3 4 0]\n",
      " [5 6 7 8 9]]\n"
     ]
    }
   ],
   "source": [
    "# if set padding to 'post', zeros are appended to end of sentences\n",
    "pad_seq = pad_sequences(seq, padding='post')\n",
    "print(pad_seq)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
